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ABSTRACT 



A method for determining k nearest-neighbors to a query 
point in a database in which an ordering is defined for a data 
set P of a database, the ordering being based on 1 one- 
dimensional codes C a , . . . , C v A single relation R is created 
in which R has the attributes of index-id, point- id and value. 
An entry QXCqfa)) is included in relation R for each data 
point p t - EP, where index-id equals j, point-id equals i, and 
value equals C ej (p t ). A B-tree index is created based on a 
combination of the index- id attribute and the value attribute. 
A query point is received and a relation Q is created for the 
query point having the attributes of index-id and value. One 
tuple is generated in the relation Q for each j, j«l, . . . , 1, 
where index-id equals j and value equals C ty {q). A distance 
d is selected. The index-id attribute for the relation R of each 
data point p, is compared to the index-id attribute for the 
relation Q of the query point. A candidate data point p ( is 
selected when the comparison of the relation R of a data 
point p ( - to the index-id attribute for the relation Q of the 
query point is less than the distance d. Lower bounds are 
calculated for each cube of the plurality of cubes that 
represent a minimum distance between any point in a cube 
and the query point. Lastly, k candidate data points p. are 
selected as k nearest-neighbors to the query point 

2 Claims, 1 Drawing Sheet 
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METHOD FOR COMPUTING NEAR 
NEIGHBORS OF A QUERY POINT IN A 
DATABASE 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to the field of computing. 
More particularly, the present invention relates to a method 
for processing queries in a database management system. 

2. Description of the Related Art 

Imagine a database DB consisting of points from S^D 1 
x . . . x D„, where D^R. For present purposes, the following 
discussion will be restricted to the case of S=D", where D is 
the set of rational numbers in [0,1) having denominators that 
are powers of 2, although the results can be extended to a 
general case. Each D,- usually consists of either integers or 
floating-point numbers. Each point in the database can be 
represented by an n-tuple (x lf . . . x„) of real numbers, where 
n can be on the order of, for example, 100. A k nearest- 
neighbors query consists of a query point q=(q t -, . . . , qJED" 
and an integer k representing the number of database points 
that are to be returned as being near to the query point. The 
query point may not necessarily be in the database. The 
sense of the nearest can be with respect to a Euclidean metric 
or another 1^-metric. An (exact) output set 0 consists of k 
points from the database such that 

Vp'EO and Vp''EZ)fl\0|lp'-?||£||p''-gj|. 

If the database is large and a quick response is required, 
a good approximate output is usually sought. The approxi- 
mate output can be a set of points that overlaps the exact 
output set 0 to a large extent, or a set of points having 
distances to the query point that are not much larger than the 
distances of the exact output set to the query point. Image 
features, for example, are sometimes mapped into D" within 
an image database management system (DBMS). Image 
similarity is determined based on a distance between fea- 
tures in D". A similarity metric for the image DBMS can be 
an approximation of a desired degree of similarity, so adding 
a small approximation error associated with an approximate 
output would not significantly affect the results. 

A problem associated with a k nearest-neighbors query is 
how a DBMS application processes such a query so that a 
suitable approximate output is returned within a desired 
response time. The goodness of an approximate output 
depends, of course, on each application, and the response 
time depends on the processing needs of the DBMS, such as 
disk I/O and CPU time. All currently known methods for 
generating an output to a k nearest-neighbors query require 
calculation of distances between the query point and many 
database points. The computational effort is dominated by 
the number of distance calculations. Data points are fetched 
from random locations in the database. A database having a 
high dimensionality requires that the database indexes that 
are used cannot significantly restrict the number of points 
that must be fetched. In many cases, a complete linear scan 
of the database out-performs the currently known methods 
for generating an appropriate output. 

Various approaches have been tried for determining near 
neighbors in a database, such as by using bounding boxes or 
spheres for indexing multidimensional data, by using pro- 
jections for inducing ordering on database points, and by 
clustering data points. Most approaches are, nevertheless, 
limited by the dimensionality of the database. Results based 
on databases having two- and three -dimensions can be quite 
misleading when extrapolated to databases having higher 
dimensions. 
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For a bounding box or a bounding sphere approach, many 
hierarchical structures, such as the R-tree family, the hB-tree 
family and the TV-tree, have been proposed for indexing 
multidimensional data in database management systems that 

s collect data points into disk pages and compute bounds on 
the points in the disk page. A bound is usually a minimal 
bounding box that is parallel to the axes of the system, 
however, bounding spheres (sphere trees) and convex poly- 
hedra (cell-trees) have also been tried. Nevertheless, the 

10 only structures that are used in practice are those of the 
R-tree family. Tne collection of disk pages containing the 
points is stored at the bottom level of the hierarchical 
structure. The next level up is created by taking a collection 
of bounds, e.g., the bounding boxes, and treating the col- 

15 lection as data points. The data points are then collected into 
groups, each fitting on a disk page, and a new bound is 
computed for each group. The new bound is of the same type 
as the bounds of a lower level. For example,, if boxes are 
used, then the upper levels use boxes as well. Levels are 

20 created until the top level fits on one disk page. 

A range query to a structure amounts to specifying a 
region. Points in the region are determined by going down 
. the structure. If the region overlaps the bounds of a page, 
then the subtree under that page may contain points that are 

25 in the region. A nearest-neighbors query is processed as a 
query about a small spherical region. If the result does not 
contain enough points, then the region is enlarged by, for 
example, doubling the radius of the sphere. 

Conventional bounding approaches work well as long as 

30 the indexing structure behaves well. That is, the complexity 
of searched-for points in a region should be proportional to 
the volume of that region, assuming the region is convex and 
"nice". This assumption may be justified only in databases 
of very low dimension, that is, up to 4 dimensions. Careful 

35 construction of a hierarchical structure may relax this con- 
straint slightly, permitting searches in 6 or 7 dimensions. For 
higher dimensions, though, conventional bounding 
approaches perform far worse than a sequential scan of the 
entire data set. 

40 A one-dimensional projection approach induces an order- 
ing on a set of database points so that a projection of a query 
point within the ordering can be quickly located. Projections 
are continuous, so close points have close projections. On 
the other hand, distant points may also have close projec- 

45 tions. The higher the dimensionality of the database, the 
more severe the problem of distant points having close 
projections becomes. That is, for a high dimensional query, 
when candidates or nearest neighbors are selected from 
among the points having projections that are close to the 

50 projection of the query, the number of candidates becomes 
large. Good nearest neighbor candidates should have many 
close projections. Nevertheless, the problem of determining 
good candidates is closely related to the nearest neighbors 
problem itself. An interesting theoretical result is reported 

55 by D. P. Huttenlocher and J. M. Kleinberg, "Comparing 
point sets under projection/' in Proceedings of the Fifth 
Annual ACM-SIAM Symposium on Discrete Algorithms 
(1994), pp. 1-7. The difference between orderings based on 
one-dimensional projections and orderings based on space 

60 filling curves is that, in the latter, proximity in the ordering 
implies proximity in the space, so if there are sufficiently 
many orderings, it suffices to consider candidates who are 
close in at least one of the orderings. 

Clustering the database points into clusters reflecting 

65 proximity is believed to help in the search for near neigh- 
bors. Each cluster is represented by either a database point 
in the cluster or by the centroid of the cluster. The cluster 
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having a representative point that is closest to the query space. Nevertheless, two points that are close in the space 

point is searched first. Other clusters are searched in order of may not be as close within the order. For good retrieval 

proximity of their representative points to the query point or performance, it suffices that points that close in the space, 

based on bounds that are derived in various ways. The search but not close within a particular order are close at least in one 

is expected to end without checking all the database points. 5 order. The present invention forms the linear orders in a way 

As with other conventional approaches, clustering that has a high likelihood for the points to be close in at least 

approaches may break down in a high dimensional database one order. 

because it may not be possible to identify sufficiently many The orders are based on curves that deal with the different 

clusters as being distant. dimensions of the database in a random order. Additionally, 

What is needed is a technique for providing a fast 10 the subdivision of the space defining the curve is shifted 

response to an k nearest-neighbors query. randomly, so that there is a high probability that in at least 

ciruuAov ac tuc iKn^KmAM one of me orders, a query point will not be close to a 

SUMMARY OF THE INVENTION pathological region where points close in the space have 

The present invention provides a technique for providing distant codes. The B -trees data structure of the present 

a fast response to an k nearest-neighbors query. The advan- 15 invention is easily updated by adding and/or deleting data 

tages of the present invention are provided by a method for points. The underlying orders represented by the various 

determining k nearest-neighbors to a query point in a B-trees are data independent, that is, the curves are fixed in 

database in which an ordering is defined for a data set P of advance, so inserting or deleting a point takes 0(log N) time 

a database, the ordering being based on 1 one-dimensional per tree, where N is the number of points in the database, 

codes C lf . , Cj. Preferably, the step of defining an 20 At query time, the position of the query point in each 

ordering for a data set P of a database forms a plurality of order is determined by a technique that generalizes a Gray 

cubes. A single relation R is created in which R has the code calculation, but stops the calculation of bits as soon as 

attributes of index-id, point-id and value. An entry Q,i,C €j the position of the query point in the order relative to the 

(p,)) is included in relation R for each data point p ( EP, where database points has been determined. A set of candidate near 

index-id equals j, point-id equals i, and value equals C^). 25 neighbors is then extracted from each B-tree by simply 

A B-tree index is created based on a combination of the selecting the identification of the points that are within, for 

index-id attribute and the value attribute. A query point is example, 30 positions from the query point in the tree. The 

received and a relation Q is created for the query point set of candidate identifications contains many duplicates, so 

having the attributes of index-id and value. One tuple is the duplicate points are eliminated. The number of points 

generated in the relation Q for each j, j-1, . . . , 1, where 30 from each B-tree remaining after eliminating duplicate 

index-id equals j and value equals C^(q). A distance d is points may then need to be increased so the number of 

selected. The index- id attribute for the relation R of each candidates is at least k. 

data point p, is compared to the index-id attribute for the e acn stores not only ^ identification of particular 

relation Q of the query point. A candidate data point p # . is poinls> b ut also additional information in the form of bits 

selected when the comparison of the relation R of a data mat determine the order among database points represented 

point p,. to the index-id attribute for the relation Q of the by me lree< For eacD poim identification, the bits provide 

query point is less than the distance d. Lower bounds are i owcr ^d upper bounds on the distance of the data point 

calculated for each cube of the plurality of cubes that from a query point, and are useful for deciding whether a 

represent a minimum distance between any point in a cube particular candidate should be fetched. After several candi- 

and the query point. The step of comparing is terminated date points have been fetched and their respective distances 

when no lower bound is less than a distance between the t0 tne query poim have been evaluated, the bits stored in the 

query point and any of the candidate data points. Lastly, k trees may indicate that certain candidate points do not need 

candidate data points p, are selected as k nearest-neighbors t o be fetched because their distances to the query point are 

to the query point. ^ greater than the distances of many other points that have 

BRIEF DESCRIPTION OF THE DRAWING J^?? 1 the \ dista "<*?. l ° ,he 

query point are greater than an upper bound. Distance 

The present invention is illustrated by way of example calculation at this stage can be done with respect to any 

and not limitation in the accompanying sole FIGURE which distance measure that is correlated with a location in a 

shows a method for determining k nearest-neighbors to a 50 Euclidean space. 

query point in a database according to the present invention. The k nearest points among the candidates are then 

reported. The remaining unreported candidates, as well as 

DETAILED DESCRIPTION tne position of the query point within each respective tree is 

The present invention provides a fast response to an k slorcd for a potential subsequent request for additional 

nearest-neighbors query by using a spatial indexing tech- 55 points close to the query point. The output can then be easily 

nique. Database points are sorted into n different linear incremented by fetching additional candidates rather than 

orders at the time a database is populated. An identification restarting the search based on a different value of k. 

for each database point is placed into n different prefix- The spatial indexing technique of the present invention 

compressed B-trees each corresponding to a respectively used with the database is populated is based on a family of 

different linear order. Each linear order is based on a 60 linear orde rings of a data space D". A canonical ordering or 

so-called space-filling curve, so that for any point in the encoding based on a one-dimensional code C:D"-*D is first 

space, a code of the point determines the position of the described. Other orderings based the canonical ordering can 

point in the order. Preferably, the code of the point is a single also be used. 

real number. Each tree also contains information that deter- In order to describe the calculation of C(v) for vED", a 

mines the order relations between the database points. The 65 convenient representation of v is used in which v-^^ . . , , 

space-filling curve has a good locality property in the sense v^, where Vy=2 l -_ 1 m v l y2"' (v, 7 E{0,l}, j-1, ... t n, 

that two points that are close in the order are close in the i-1, . . . , m). Thus, v-2 I _ 1 m 2~ 1 v', where ^-(v,-^ v^, . . . , 
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V *>i) (i"l, . . . , m). According to the invention, the encoding 
of a database point is recursively defined in terms of m based 
on a code J:{0,l}"-*{0, . . . , 2"-l} that is essentially the 
inverse of a Gray code, and in which affine transformations 
A f of R" map boolean vectors to boolean vectors for 
1=0, .... 2 n -l. 

The code J defines a linear ordering on the vertices of a 
unit cube in R" so that consecutive vertices in the linear 
ordering are also adjacent in the cube. The code J(x), for 
every x=(x n , ....xj (x i -E{0,l}, . . . , n), is represented 
by 

n 



where y ( E{0,l}. Thus, 0^J(x)<2". The following exemplary 
pseudocode illustrates an Algorithm J for calculating code 
J(x): 

Algorithm J; input: (x„, . . . , xj; output: (y„, . . . , yj 

1. y„:=x„ 

2. for i=n-l downto 1 do y^x, or x f+J . 

It follows that J(0, . . . , 0)=0, so (0, ... , 0) is the first 
member in this ordering. Similarly, J(1,0, . . . , 0)=2"-l, so 
(1,0, . . . , 0) is the last member in the ordering. Note that 
these two vertices are also adjacent. Further, based on the 
definition of J, 

.... *i)-2M V(0^.„ . . . , xj. 

A Gray code G(I) that is the inverse of J, can alternatively 
be used that for mapping an integer I=X,_ J ' l y l 2 , ~ :l to a vector 
x=(x„, . . . , x a ) (x,E{0,l}, . . . , n). The following 
exemplary pseudocode illustrates an Algorithm G for cal- 
culating a Gray code G(I): 

Algorithm G; input: (y„, . . . , y 3 ); output: (x„, . . . , x 2 ) 

1. x„:-y M 

2. for i«n— 1 downto 1 do x^—y,- O x,- +1 . 
Based on the definition of G: 

For/i2"- 1 , G(7)=2' , - 1 +0(2' , -l-l). 

Using the recursive algorithms for J and G, the mappings 
J and G are the inverse functions of each other. To show this, 
consider the example of an order relation between any two 
points po(p„, . . . , pj) and q«(q„, . . . , q 2 ) in [0,1)" that is 
desired to be determined. The first phase of the determina- 
tion is based on the code J. In order to have a unique binary 
representation for any number, one of two equivalent tail 
expansions, 1000 ... and 0111 is selected. Let p/-|2pj 
and q / lc, l2q / J. The tail expansions are obtained, for example, 
by discarding all the bits defining the data points under 
consideration except for the first bit of each point. 

To compare p and q, J(p„\ . . . , p, 1 ) and J(oJ, . . . , q, a ) 
are first compared. If J(p rt \ . . . , p/) and J(q„\ . . . , q/) are 
distinct, then the comparison has been decided. Otherwise, 
a comparison is performed that is based on the second bits 
of each point under consideration, and so on. In this 
situation, though, J is not used directly. Because the ordering 
reflects distance, the points are transformed using an affine 
transformation described below. When the points are 
selected independently from a uniform distribution, the 
comparison is resolved in the first phase of the determination 
with probability \ -2' n t so the expected number of passes for 
resolving the comparison is l+l/(2"-l). 

Code J induces an ordering on the 2 n sub-cubes of a unit 
cube that are obtained by cutting the unit cube with the 
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hyperplanes {x«0.5} (i=l, . . . , n). More specifically, for 
every a-(a„, . . . , al)e{0,l}", let 

JtOlO^V* if fl/-0, V4£x,<l if a r \) t 

5 Sub-cubes S(a) are disjoint and ordered by the function J, 
that is, S(a) comes before S(a') if and only if J(a)<J(a'). 
When p and q belong to distinct sub-cubes, p and q inherit 
their ordering relation from the ordering on the sub-cubes. 
When p and q are in the same S(a), the sub-cubes of S(a) 

10 must be considered. It is crucial that the orderings on the 
sub-cubes of the S(a) J s, except the last sub-cube, are such 
that for every a, the last sub-cube of S(a) is adjacent to the 
first sub-cube of S(a'), where S(a') is the sub-cube that 
succeeds S(a) in the ordering over the sub-cubes of the unit 

15 cube. When this condition is not satisfied, there are two 
distant points in the cube that are arbitrarily close in the 
linear order. In order for the indexing scheme of the present 
invention to be efficient, this condition should never happen. 
Thus, the ordering on the sub-cubes of each S(a) is deter- 

20 mined by first applying a suitable affine transformation, and 
then by applying the function J. 

For each IE{0, . . . , 2"-l}, an afiSne transformation A f of 
R" maps the set {0,1}" onto itself, and requires that C be 
defined with respect to all types of cubes Ij x . . . x I„, where 

25 each I,- is either [0,1) or (0,1], and with respect to all the 
permutations of the coordinates. For simplicity of notation, 
however, the following is preferred: 

C(v)-2-V(v^0Vi)( W ))]- 

30 

It is essential that the transformations can be computed in 
a time that is proportional to the dimension of the space. The 
affine transformation A f is composed of a "reflection" com- 
ponent and a "swap" component as: 

35 

AX^(i>(j(0Ov)). 

The reflection component is a vector s-s^EfO,!}' 1 , while 
the swap component is a permutation matrix P=P(I) of order 
40 nxn that swaps coordinate n with some coordinate i-i(I). For 
example, if i*n, then the only difference between P and an 
nxn identity matrix is that P^P^X and P tV *P„,=0. 

The following exemplary pseudocode illustrates an Algo- 
rithm s for calculating the reflection component s(I), where 
« cXO, . . - , 0, 1): 

Algorithm s; input: I; output: s»s(I) 

if I«0 then s:=0 

else if fel (mod 2) then s:«G(I-l) 
50 else if 1=2 (mod 4) then s:=G(I-l)-e„ 

else if 1=0 (mod 4) then s:-G(I-l)+e„. 

It follows from the definition of the Gray code that if k=l 
(mod 4), then G(k)-G(k-l)+e„, in which case s(k+l)-G(k)- 
e„«G(k-l). When k=3 (mod 4), then G(k)«G(k-l)-e„, in 
55 which case again s(k+l)=G(k)+e„-G(k-l). For even k, by 
definition s(k+l)»G(k). 

The swap component P=P(I) is determined by the index 
i-i(I) of the coordinate that is swapped under P with 
coordinate n. Index i(0) is defined to be i(2"-l)=l. For 
60 I- 2 *-!" X/2 1 '" 1 , such that 0<I<2 n -l, i(I) is the largest index 
j^2 such that x^-0 for all 2^k<j. The following exemplary 
pseudocode illustrates an Algorithm i for calculating an 
index: 

Algorithm i; input: I; output: i-i(I) 
65 if 1-0 or I-2"-l, then i:-l else 

1. i>2 

2. I:-l(I+l)/2J 
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3. while 1=0 (mod 2) do 

(a) i:-i+l 

(b) I:4I/2J. 

Thus, for even k>0, i(k)«i(k-l). In step 2 of Algorithm i, 
I is converted to [(k+l)/2j (for input k), and to [k/2j (for 
input k-1). When k>0 is even, then [(k+l)/2j=lk/2j, so the 
result of Algorithm i is the same for k and k-1. 
Consequently, for 0<k<2 n -l, G(k+1) and G(k-1) differ in 
coordinates 1 and i(k). 

AfiGne transformations for the case n=2 arc summarized in 
Table I: 



TABLE I 



I 


o(i-i) 


A 


s(I) 


id) 




00 






00 


1 


(*1, *z) 


01 


00 




00 


2 


(*2» *l) 


10 


01 


-01 


00 


2 


(Xj, Xt) 


11 


11 




13 


1 


(1 - x lf 1 - xj 



Affine transformations for the case n»3 are summarized in 
Table II: 



TABLE II 



r 


0(1-1) 


A 


S(I) 


i(0 


Aj(Xj, %2i ^i) 


000 






000 


1 


(*1> *2, X 3 ) 


001 


000 




000 


2 


(x^ X3, Xj) 


010 


001 


-001 


000 


2 


(x* * 3 , xj 


011 


011 




011 


3 


(x 3 , 1 - xa, 1 - xj 


100 


010 


+001 


011 


3 


(X 3 , 1 - Xj, 1 - Xj 


101 


110 




110 


2 


(1 - x^ 1 - x 3 , X J 


110 


111 


-001 


110 


2 


(1 - X^ 1 - X 3 , Xj 


111 


101 




101 


1 


(1 - X u Xj, 1 - X 3 ) 



Affine transformations for the case n=4 are summarized io 
Table III: 



TABLE III 


I 


G(i-i) 


A 


S(I) 


i(0 




A 1 (x 4) Xj, X2, xj 


0000 






0000 


1 




(x 4> X 3) X2, X 4 ) 


0001 


0000 




0000 


2 




(*2> x 3 , x 4 , *l) 


0010 


0001 


-0001 


0000 


2 




( X 2> x 3> x 4» x l) 


0011 


0011 




0011 


3 




(X 3 , X„, 1 - X2, 1 - X X ) 


0100 


0010 


+0001 


0011 


3 




(x 3 , X 4> 1 - 1 - X,) 


0101 


0110 




0110 


2 




(1 - X2, 1 - Xj, X 4) X0 


0110 


0111 


-0001 


0110 


2 




(1 - X2, 1 - X 3 , X 4 , X,) 


0111 


0101 




0101 


4 




(X 4) 1 - X 3 , Xa, 1 - Xj) 


1000 


0100 


+0001 


0101 


4 




(x 41 1 - x„ Xa, 1 - x,) 


1001 


1100 




1100 


2 




(x* 1 - x 3> 1 - X 4 , X,) 


1010 


1101 


-0001 


1100 


2 




(x 2 , 1 - x 3 , 1 - X 4 , X,) 


1011 


1111 




1111 


3 


(1 


- Xa, 1 - X 3 , 1 - X4, 1 - Xj 


1100 


1110 


+0001 


1111 


3 


(1 


- Xj, 1 - X 3 , 1 - 1 - X t ) 


1101 


1010 




1010 


2 




(1 - X2, Xj, 1 - X 4 , X^ 


1110 


1011 


-0001 


1010 


2 




(1 - X^ X„ 1 - X 4 , Xj 


1111 


1001 




1001 


1 




(1 - X u X 3 , X2, 1 - X4) 



18,295 

8 

2. while v*0 do 
2.1 Convert v into a bit vector and a remainder: 

(a) vH2vj 

(b) v:«2\v-v' 
5 2.2 Compute} I,: 

W(P-(s/.vO) 
2.3 Compute the new transformation: 

(a) s^s-^sP,)) 

(b) P:=P-P(I t ) 
10 2.4 Increment} i: 

i:-i+l. 

The mapping C is adequate for unambiguous coding of 
n-dimensional data, that is, the mapping C is one-to-one. As 
proof, suppose p and q are distinct points in [0,1)". There 

15 exist unique representations p-S^^p^"' and 
q^i 00 ^"'}, where {pV} 1 '} c: {0,1\} M (i~l,2, . . . ), and for 
every natural N and every j (1 ^n), there exist k,l>N such 
that p*«q/=0. Let i (i^l) be the smallest index such that 
pVq 1 . It follows that in the hierarchy of the sub-cubes of the 

20 unit cube that are obtained by repeatedly halving all the 
dimensions, there exists a sub-cube S 0 of the unit cube 
having edges of length 2~' V1 , such that {p,q} <=S 0 , whereas 
there exist two disjoint sub-cubes S 1 ,S 2 of S 0 , having edges 
of length 2~\ such that pESj and qES 2 . The points of S! and 

25 S 2 are mapped under J into two disjoint intervals of [0,1) 
each of length 2' {n . Thus, J(p)*J(q). 

For convenience, the following notation is introduced. For 
m-1,2, . . . , 

1. For k lf . . . , k„ such that 0ik^2 m -l (j-1, . . . , n), the 
30 cube S(k l5 . . , , k m ; m) is defined to be equal to 

{xeRlk^-^x^.+l), 2" m , j=l, . . . , n). 

2. S(m) denotes the collection of all the cubes S(k : , 
K; m). 

35 3. For every SES(m), the members of S(m+1) that are 
contained in S are denoted by S(m+l;S). Also, if 
SES(m) is not the sub -cube that is mapped under C into 
[l-2 _mn ,l), then the member of S(m) that succeeds S in 
the ordering induced by C is denoted by S'. For 
d0 example, if S is mapped into [k2 _wm ,(k+ 1 )2 - "" 1 ), then S' 
is mapped into [(k+l)2- w, ,(k+2)2- mn ). 
Note that for each m, the members of S(m) are mapped 
under C into 2 mn disjoint (half open) subintervals of [0,1), 
each of length 2~ mn . For m=l,2, . . . , and for every SES(m), 
45 except for the last S relative to the ordering induced by C, 
the last member of S(m+l;S) and the first member of 
S(m+l;S') are adjacent sub-cubes, that is, they have a 
common facet. As proof, first consider the case of m=l. 
Suppose SES(l) is mapped into the interval 

50 r(i^=[*2- t (*+i)2-' t ) 

where 0^ki2"-2. Thus, the last member of S(2;S) is 
mapped into the interval 



The following exemplary pseudocode illustrates an Algo- 
rithm C that is used for calculating the code C(v) of a given 
point . . . , Vj), where each v f is an m-bit number, 

V2 ( ./%2-'(v 0 <:{0,l}, j-1, . . . , n, i=l, . . . , m). Algorithm 
C produces the representation C(v)-X /< _ J m l / , 2~" (0iJ,<2" 
(i=l, . . . , m), which also produces the representation 
v-£ f .r2-' V, where v*-(v^, v a , . . . , vj (i=l, . . . , m). All 
the operations on vectors in Algorithm C are component- 
wise. 

Algorithm C; input: v=(v n , . . . , v^; output: CoC(v)»2 ( ._ 
L i:-l; s:-0; P-I 



55 r i ^7X2^"(k+l)-lH(k+l)2-"-2- 2 ",(k+l)2""), 
while the first member of S(2;S') is mapped into 

^7X2,2"^+ l))-{(k+l)2- w ,(k4l)2- fl +2- :ta ). 

60 If a point v^ViV+Hv^+e^ (where VE{0,l} rt , i-1,2, 
and e^E[0, V*f) has C(v /? )ET R , then the first two values that 
Algorithm C calculates for point v R are Ij=k+1 and I 2 =0. 
Thus, necessarily, 

65 /(vV)«*+i 

which is the same as 
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v/t'-c^+i). As proof, consider p,qE[0,l) M , and denote by m the number 

such that 

Now, the first values of the other objects computed by 
Algorithm C arc 2- m "S|c(p)-C(9)|<2-< m - l >. 

WW^iMk+i), 5 Without loss of generality, suppose C(q)<C(p). Because the 

anc j 2 _m "^C(p)-C(q), and because each member of S(m) is 

mapped by C into a half open interval of length 2~ m ", points 
p R l B p R °*pQc+\) m pQc+i). P and q cannot belong to the same member of S(m). On the 

10 other hand, because C(p)-C(q)<2" (m_1) ", there exists a k, 
Thus, the other necessary condition for v^-Viv^^V^v^+e l^k^2 C/7,_I) "-l, such that 



to be mapped into T„ is: 



Jt-l)2"<' w - l > , ^C(g)<C(p)<(Jfc+l)2-<'"- 1 >'. 



If either CCp)<k2- (M - 1) '' or C(q)*k2- ( " , - 1 >\ then both p 

which, because G(0)=0, is the same as s(k+ 1)^^=0, or is and q belong to the same member of S(m) and, hence, 
simply 

v*s{k+l). ^<^<T m «{2' m ^^C{pyC{q)\^. 

Similarly, if a point v L °V^ L x ^v L ^ L has C(v L )ET L , 2Q Otherwise, 
then the first two values that Algorithm C calculates for v^ 

are Ij-k and I 2 -2"-l. Thus, necessarily, c{q)<ia< m - x >^c{p). 

In this case, p and q belong to distinct members of S(m) 

J{ v r 1 )-* that are mapped under C into consecutive intervals [(k-1) 

, . 25 2-™,k2-™) and [k2"™ (k+l)2-™). These members of 

which is the same as s(m) must be ad j acent 

Because the first values of the other objects computed by A j- * »u • *u ftL 

Algorithm C are s>s(k) and P L >=P(k), the other necessary 30 . Accor * n ef° m ™T' *° , ^ P °f 

condition for v t -fev/+Mv L »« to be mapped into T L is: m each ordensdetermmed by using a generahzed Gray code 

^ *- L vv L calculation. The calculation of bits is stopped as soon as the 

r-i-/W('i l "(s , £. l O»' i a )v(/ , (*)'W*)OVi^). position of the query point in the order relative to the 

database points has been determined. For example, consider 

Because G(2 M -l)=e„, this is equivalent to 35 a search based on a single one-dimensional ordering or 

2 ftWD/M v_ Mrs. encoding. Suppose C:[0,1) M — [0,1) is a mapping having the 

v L ^WOiPWe^woe^ following characteristics: (i) C is one-to-one, and (ii) the 

where e, denotes a unit vector with 1 in coordinate j . The fact invcrse capping C l J^[0 t l) n is continuous, where I_eI0,1) 

that v/-G(k) and v^Gfk+l) implies that points mapped K ^ ima S e of C - Because C is one-to-one into an interval, 

into T L and T« must respectively belong to adjacent mem- 40 C defines an order on 

bers of S(l), but in order to prove that they belong to y»c(x)*c(y) 

adjacent members of S(2), v L and v^ 2 must be relied on. In ' 

fact, it suffices to prove that v L z and v^ 2 always differ in one Let 
coordinate. 

For odd k,s(k+l)=G(k-l), while for even k,s(k+l)oG(k). 45 Mp» ■ ■ ■ . Ps)^AT- 
Thus, for odd k, s(k+l)=s(k). When k is odd, then v^ 2 =s(k+ 

l)=s(k) and v A 2 =s(k)Oe, Cfc) , so v/ and v L 2 differ in precisely ^ lmear oidering can be implemented as a simple relation 

one coordinate, namely, i(k). Further, when k is even, R Wlth onl y two attributes: an point-id and a value. Each p f 

v/=s(k+l)-G(k) and v^-sOOQc^, so: has 0DC cntrv m R whcre P oint - id ca . uals i a " d value equals 

(a) if k»0, then and v^ 2 -^, so v/ and v A 2 differ 50 . , , , t . , crA 1V1 , 
in precisely one coordinate; A ^ er > J?™ to of tw ° oomponcnte: a point qE[0 1) and 

. x .J ; , * a number kEN. An exact answer to the query consists a set 

(b) if le=0 (mod 4) and k>0, then s(k)-G(k-l)+e ll so 0EP of size k so lhat 

v £ 2 KG(k-l)+c 1 )Oc^-G(k-2)Oc Wf 

whereas v^-GCk), But G(k) and G(k-2) differ only in 55 VxEOVyE(P\o)\\K-q\\*\\y-ql 
coordinates 1 and i(k-l). Also, i(k)«»i(k-l). Thus, v fl 2 and 

v L 2 differ in only one coordinate, namely, coordinate 1; and ^ relation R is used for obtaining a set of candidates from 

(c) if fe2 (mod 4) and k>0, then s(k)-G(k-l)-\ ei , so R For everv 6>0 define R ( 6 ^ 88 

Thus, as in the previous case, v* 2 and v L 2 differ in only Because ^ is continuous, the points in R(6) are likely to be 

one coordinate i(k) c ^ osc m tne s P ace l^^T- A good implementation for one 

Finally, given SES(m), where m>l,R denotes the member su ^ h mdex m a conventional DBMS is to form a B-tree 

of S(l) that contains S. ind J? " a c tlnbute v r al " e - 

Thus, for every two points p,q in [0,1)", 65 ^ definition of C lends itself to a multitude of other 

possible one-dimensional orderings or encodings because 

\]p~q\L^2\C(p)-c(q)\ 1M . the dimensions 1, . . . , n can be arbitrarily permutated and 



01/06/2004, EAST Version: 1.4.1 



6,148,295 

11 12 

each permutation defines a different ordering. Sub-cubes AND R.value<=Q. value +d 

that are distant relative to one permutation may be close AND R.vaIue>=Q. value -d 

relative to another one. The sub-cubes hierarchy itself, The SQL query returns R(d;q). If the results are satisfying, 

however, is the same for all the permutations. In order to the k nearest neighbors can be found from the candidate set. 

obtain further orderings that do not share the same hierarchy, 5 Otherwise, a larger candidate set can be obtained by increas- 

a new ordering is defined that is based on C by using a ing d. The new candidate set will be a superset of R(d;q). 

translation vector eE[0,'/ 3 }". A mapping C t :[0,l) M -*[0,l) is a set of candidate near neighbors is extracted from each 

defined by B-tree by simply selecting the identification of the points 

that are within, for example, 30 positions from the query 

c (x)-c(y4(x+t)) 10 P oml m ^ lree ' ^ e x{ °^ canc ^ date identifications contains 

* * 4 X * many duplicates, so the duplicate points are eliminated. The 

For each c, a candidate set is defined by number of points from each B-tree remaining after elimi- 
nating duplicate points may then need to be increased so the 

R&a4rtP&\\CM-cM&>)- number of candidates is at least k. 

A fixed number of orderings are used, using a positive 15 P?™"^ indicated, the set of candidates that is 

parameter 8 as follows: obtained as the union of the sets of candidates provided by 

the various orderings is not guaranteed to contain all the k 

. Select vectors e a , . . . , e ; . nearest neighbors. In many cases an extremely good 

2. For each ordering, define a relation R, similar to R, approximation is achieved, either in terms of the overlap 
based on C €ii instead of C 20 with the set of the true k nearest neighbors, or in terms of the 

3. For a query (q,k), return a candidate set actual distances between the query point and the reported 

neighbors when compared to the distances between the 

/ (2) query point and the set of the true k nearest neighbors. The 

?) = U R fr q) present invention also provides a technique for finding the 

1=1 25 true k nearest neighbors. This extended feature is used when 

it is essential to get the exact result. If an exact output is 

4. For each point in R(8;q), calculate its distaace from q desired, the search must be extended and a stopping criteria 
and finally return the k closest points. be deflned 50 when ^ algorithm stops it is guaranteed 

The sole FIGURE shows a process 10 for determining k ,„ t0 PF 0V,de L th ® exact resu A „ , , 

nearest-neighbors to a query point in a database according to 30 The method works as follows. Candidates are obtained by 

the present invention. Consider a query according to the enlarging the scanned intervals within each of the orderings. 

present invention in a conventional DBMS. At step 11, an At ' he same ame> lower bc T ds are calculated for various 

ordering is defined for a data set P of a database. Tlie cubes giving the minimum distance between any point in 

ordering is based on e one-dimensional codes C, C,. „ «"* rabe « nd ' lne point. The algorithm stops the search 

Preferably, the step of defining an ordering for a data set P 35 as S00D 28 available bounds imply that no unchecked 

of a database forms a plurality of cubes. At step 12, single daUbase P ou,t L » clos f! t0 tbe ^ than ™V ° f *e current 

relation R is created in which R has the attributes of k near « st neighbors The sub-cubes are stored in the same 

index-id, point-id and value. At step 13, an entry Q.i.CJp,)) Tees that store the linear orders. 

is included in relation R for each data point Pi EP, where 4Q ,f ™ <W B a 1 uer y P 0lnt and 

index-id equals j, point-id equals i, and value equals C^. fl . fi . b)m{keir \a t ix-&b l M «} 

A B-tree index is created at step 14 that is based on a ' ' ' 

combination of the index-id attribute and the value attribute. ( a > b ER ") k a rectangular box, then the Euclidean distance 

A query point is received at step 15 and at step 16 a between q and B can be computed as follows. The nearest 

relation Q is created for the query point having the attributes P oint of the cube is obtained by minimizing i,."(qr*d 9 

of index-id and value. At step 17, one tuple is generated in subject to a,ix,Sb,{i-l n). The objective function is 

the relation Q for each j, j-1, ... 1, where index-id equals separable, so the optimal solution can be described as 

j and value equals CJq). A distance d is selected at step 18. follows: (i) If q.Sa,., then let d,-a,-q,.; (ii) If a,Sq,Sb„ then 

At step 19, the index-id attribute for the relation R of each Iet d <°°; and ("0 If V^ b <> lhen let d,=q,-b,.. The distance 

data point p, is compared to the index-id attribute for the between q and B is given by 8(q,B)-||d||, where d-^, 
relation Q of the query point. A candidate data point p, is 

selected at step 20 when the comparison of the relation R of Suppose for a given q there are k points in the database 

a data point p, to the index-id attribute for the relation Q of lhal are nearest to q that are desired to be obtained. Suppose 

the query point is less than the distance d. Lower bounds can further the database points p\ . . . , p* have already been 

be calculated for each cube that represent a minimum $s located as candidates for the k nearest neighbors, such that 

distance between any point in a cube and the query point. IteVlK . . <ll<7-p*l|. 

Thus, the step of comparing, step 19, is terminated when no „ . . . t . t t . . . 

i u a - 1 t u ?• . u . .u • * fa-Pi 1S an upper bound on the distance to the kth nearest 

lower bound is less than a distance between the query point I • uu m. u n , • j , L • , 

a e*u j'j a* ' * a » ♦ a' a . neighbor. Thus, if some box B contains database points, and 

and any of the candidate data points. At step 19, k candidate */« uwiu *ii \u *u • a* u u r» u 

, . J . . i . j 1 , • utL * *u o(q,B)> q-p^ then there is no need to search box B because 

data points p. are selected ask nearest-neighbors to the query „ • * u (L t 4 , 

• . 60 nonc ™ lts points can be among the k nearest neighbors. 

P °™.' f ii • i a a - cr%T -ii 4 * Sucn bounds become useful in case an exact output is 

The following exemplary pseudocode in SQL illustrates a desirc£) Each ord are mamtain6d ^ tr „ 

query according to the present invention. Of course, other where nodes store informaUon a £ u , boundj „ and 

qUe /„ y r ^^"^^f ° ^ lbc orderi "g ^ itself to creating good boxes. 

SELECT DISTINCT point-id 65 Let v 1 , . . . , v" be all the database points. Consider one 

FROM R,Q ordering induced by a code C, so, without loss of generality, 

WHERE R.index-id-Q.index-id assume that 



2 



01/06/2004, EAST Version: 1.4.1 



C(v l )<. . . <C(\T). 

For each i . . 



6,148,295 

13 14 

What is claimed is: 

r N), use the representation 1. A method for determining k nearest-neighbors to a 

query point in a database, the method comprising the steps 
of: 



✓ = jjS- V 5 

Thus, the components v n , . . . , v^ determine the first level 
of the ordering. Recall that S(l) denotes the collection of the 
sub-cubes at the first level of the hierarchy, so it has 2" 10 
members, each with a volume of 2~". Because N is expected 
to be much smaller than 2", most of the members of S(l) will 
not contain any database point. The members of S(l) that do 
contain such points can be stored at the leaves of a balanced 
tree according to the order that is induced on them by the 15 
codes J(vil), that is, the inverse of the Gray code. Due to the 
nature of this code, if two sub-cubes are close in the 
ordering, then there exists a relatively small box that con- 
tains their union. For example, any two consecutive mem- 
bers form a box of volume 2~ m ' 1 9 and any two members that 2 o 
are at distance 2 in the ordering are contained in a box of 
volume 2~"* 2 . In general, if the two children of a node in the 
tree hold the boxes B(a 1 ,b 1 ) and B(a 2 ,b 2 ), then the parent 
holds the box B(a,b) where a=min(a/,a ( 2 ) and b-max(b/, 
b,- 2 ). Thus, relying on the currently known upper bounds on 
the distance to the kth nearest neighbor, the present inven- 25 
tion can determine not to proceed into the subtree rooted at 
the node, if the box associated with it is sufficiently distant 
from the query point. 

During the search for the exact k nearest neighbors, the 
present invention maintains the following information. First, 30 
there is a global upper bound on the distance between the 
query point and the kth nearest neighbor. Second, in each 
tree, there is the current interval in the corresponding 
ordering all of whose points have been checked. Moreover, 
in each such tree some of the nodes are marked as ones 35 
whose subtrees should not be checked. The intervals are 
expanded repeatedly, and the upper bound is updated. At the 
same time, lower bounds are updated for a collection of 
nodes having subtrees that cover the unchecked data points. 
In each tree, the number of such subtrees at any time does 40 
not exceed 0(log n) where n is the number of data points. 
The search may stop when in one tree all the lower bounds 
of subtrees that cover the unchecked points are greater than 
the global upper bound. 

While the present invention has been described in con- 
nection with the illustrated embodiments, it will be appre- 45 
ciated and understood that modifications may be made 
without departing from the true spirit and scope of the 
invention. 



defining an ordering for a data set P of a database, the 
ordering being based on 1 one-dimensional codes 

Cj, . . . , C;J 

creating a single relation R having attributes index-id, 

point-id and value; 
including an entry Q,i,C^(p^) in relation R for each data 

point p,EP, where index- id equals j, point -id equals i, 

and value equals C €j (p t ); 
creating a B-tree index based on a combination of the 

index-id attribute and the value attribute; 
receiving a query point; 

creating a relation Q for the query point having attributes 

index -id and value; 
generating one tuple in the relation Q for each j, . . . , 

1, where index-id equals j and value equals C €y {q); 
selecting a distance d; 

comparing the index-id attribute for the relation R of each 
data point p ( - to the index-id attribute for the relation Q 
of the query point; 

selecting a candidate data point p, when the comparison of 
the relation R of a data point p ( - to the index-id attribute 
for the relation Q of the query point is less than the 
distance d; and 

selecting k candidate data points p,- as k nearest- neighbors 
to the query point. 

2. The method according to claim 1, wherein the step of 
defining an ordering for a data set P of a database forms a 
plurality of cubes, the method further comprising the steps 
of: 

calculating lower bounds for each cube of the plurality of 
cubes, the lower bound representing a minimum dis- 
tance between any point in a cube and the query point; 
and 

terminating the step of comparing when no lower bound 
is less than a distance between the query point and any 
of the candidate data points. 

* * * * * 
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